GapMinder CO2 Emissions Data - Carbon Dioxide Analysis and Driver Exploration

Udacity Data Analytics Nano Degree | Project 1 - Data Wrangling

By: Cory Robbins | Jan 15, 2021

Table of Contents

Introduction

In this project we investigate the Gapminder CO2 emissions data as well as other variables; life expectency, population, gdp, and, income - all variable data sets are known drivers of CO2 emissions -- we then compare with data of airline carrier departures and passanger flights to help us understand CO2 emissions.

Gapminder Data sets used

In late 2020 Airbus announced three new Hydrogen, "zero-emission' concept aircrafts, each representing a different approach to exploring various technology pathways towards meeting future climate-neutral targets as set by the company and in accorence with European Green Deal standards. The company expressed that the time is prime for new develoment in areas such as Hydrogen to take hold so that companies such as Airbus and the overall industry governments can work in accordance to meet innovation requirments and move towards an entirely new way to fly by 2030.

https://ec.europa.eu/info/strategy/priorities-2019-2024/european-green-deal_en

We hope to gain insights into known drivers of CO2 emissions across countries and regions including identify potential connections between known driving indicators of increased CO2 emissions as well as flight data and see how total passangers and total departures.

Natrurally, I wanted to check if there are any potential relationship beween flights and CO2 emissions, but before we analyze flights we pulled in known variables life expectancy, gdp, population and income.

Questions we would like to answer in this exploration are the following:

Data Wrangling

In this section of the report, the data is laoded, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

importing CSV files

including a list of continents grouped counties by continents for further analysis

transforming the data

Now we want a flatten (tall and skinny) representation of the data to be more similar to other online resoources.

For this we need the stack / unstack operations.

stacking the data

combining the data into its own DataFrame

NaN 'not a number'

using .isna() and .sum() to compare with the shape to analyze the number of missing-values

missingno is a missing data visualization module for Python

Exploratory Data Analysis

In this section we will explore the data in differnet ways and to compute statistics and create visualizations with the goal of addressing the research questions posed in the Introduction section.

It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Top_25 CO2 emitters per capita

Fasters growing countries of producers of Carbon Dioxide per capita

box plots of top-25 CO2 emitters

In order to quickly compare known indicators we plotted the top-25 growers of CO2 emmiting countries to understand the data better as a starting to point

What are the relatonships between the various indicators?

In order to plot the variables in a correlation coefficient heat map so we can see a highly correlated variable to the other we need to first group the data into means and putting it into its own statistics dataframe.

finding the means

correlation matrix

Run correlation matrix to see how closely related the indicators are

Itertools

We use the itertools module in python, to loop over the various combinations which have a positive coefficient and plot them along with a regularly fitting curve to show which variables indicate a possible or near relationship.

Without making difinitive assumptions about the data, we only want to visualize the possoble relationships so we can make some kind of indication on our own how these possible indicators may impact eachother. By plotting the relationships, you can also see the skewness in the points and for think about what posssible outliers exist.

grouped analysis by continental or econic region

A further analaysis could be made by continent and economic region..

Visualization of Data

Plotly scatter plots

We will fist plot the means after importing the module plotly module to produce various figures to represent the data

Figure 1: Total CO2 consumption on the virtical axis, with population represented by the size of the bubble

Figure 2: Total CO2 consumption on the virtical axis, with population represented by the size of the bubble

Figure 3: Total CO2 consumption on the virtical axis, with population represented by the size of the bubble

Conclusions

In this project we investigating CO2 emissions data provided by Gapminder and explore data sets of other know indicators such as life expectancy, population, gdp, and, income and compare those to flight data as provided by Gapminder to help us understand how known indicators of CO2 emissions may also influence our behavior of flying leading to aerospace industries contribution to the to overall global CO2 emissions.

In conclusion about the GapMider data set exploration - without implying causation from correlation - we can conclude that various drivers and directly related variables such as Population, Life Expectancy, Income and GDP show have an impact on the amount of CO2 (per capita) a country emits.

Results:

  1. The data suggest there is certainly a connection between the known variables Population, Life Expectancy, Income and GDP with the amount of CO2 is emmitted per capita per country
  2. The data also suggests that many other factors along with those knows drivers used in the analysis such as the various types of industry, energy prodcyiton or deforestation (logging / forestry industry) all would likely show similar relationships as the ones selected in this project.
  3. All variable inscluding flights and carrier flights trend in the same direction as other drivers such as income, and GDP. therefore the data could suggest that behavior is also a driving force as higher income, life expectency. etc. tend to result in higher number of passanger glights in general as shown in Figure 3

Limitations:

  1. Limitations exist due to the categorical nature of the dates when analyzing the data. Therefore we cannot have a very high level of statistical methodoology can be used other than basic correlations to showcase the nature of the potential relationshios between the variables.
  2. Another clear limitation is the fact that countries all develope at different rates as well as have cyclical data when it comes to merging developing countries causing an incocistancy when looking at regions which may not be equal when it comes to their economic impact or regional development.
  3. We can make staticis used here are descriptive statistics instead of inferential which would require a more scientific approach using a controlled experiments with a hypothesis rather than the exploritory inferences we make with our data.